Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
数据在于现代深度学习的核心。监督学习的令人印象深刻的表现建立在大量准确标记的数据基础上。但是,在某些现实世界中,准确的标签可能不可行。取而代之的是,为每个数据示例提供了多个注释者提供多个嘈杂标签(而不是一个精确的标签)。在这样的嘈杂培训数据集上学习分类器是一项具有挑战性的任务。以前的方法通常假设所有数据示例共享与注释误差相关的相同参数集,而我们证明标签错误学习应既是注释者,又是数据示例依赖性。在这一观察结果的激励下,我们提出了一种新颖的学习算法。与MNIST,CIFAR-100和Imagenet-100的几种最新基线方法相比,该方法显示出优势。我们的代码可在以下网址获得:https://github.com/zhengqigao/learning-from-multiple-annotator-noisy-labels。
translated by 谷歌翻译
示例引导图像生成的一个关键挑战在于在输入图像和引导图像之间建立细粒度的对应关系。尽管结果有令人鼓舞,但先前的方法还是依赖于对计算每点匹配的密集关注的依赖。在本文中,我们提出了一个动态稀疏注意的变压器模型,称为动态稀疏变压器(Dynast),以实现具有优惠效率的优质匹配。我们方法的核心是一个新颖的动态注意事项单元,致力于涵盖最佳代币数量的差异。具体而言,Dynast利用变压器结构的多层性质,并以级联的方式执行动态注意力方案,以完善匹配结果并合成视觉上令人愉悦的输出。此外,我们还为Dynast引入了一个统一的培训目标,使其成为监督和无监督场景的广泛参考图像翻译框架。对三种应用,姿势引导的人形象产生,基于边缘的面部合成以及未变形的图像样式转移的广泛实验表明,朝代在本地细节中实现了卓越的性能,超过了最新的技术,同时降低了计算成本。我们的代码可从https://github.com/huage001/dynast获得
translated by 谷歌翻译
数据混合(例如混合,cutmix,resizemix)是推进识别模型的重要组成部分。在本文中,我们专注于研究其在自我监督环境中的有效性。通过注意共享相同源图像的混合图像彼此内在相关,我们在此提议SDMP,缩写为$ \ textbf {s} $ imple $ \ textbf {d} $ ata $ \ ata $ \ textbf {m} $ ixing $ ixing $ \ textbf {p} $ rior,要捕获这个直接但必不可少的先验,并将混合图像定位为其他$ \ textbf {potition pairs} $,以促进自我监督的表示的学习。我们的实验验证了所提出的SDMP可以使数据混合有助于一组自学的学习框架(例如MoCo)实现更好的准确性和分布外的鲁棒性。更值得注意的是,我们的SDMP是成功利用数据混合以改善(而不是伤害)视觉变压器在自我监督的环境中的性能的第一种方法。代码可在https://github.com/oliverrensu/sdmp上公开获取
translated by 谷歌翻译
多模式知识蒸馏(KD)将传统知识蒸馏扩展到多模式学习的领域。一种常见的做法是采用良好的多式联运网络作为老师,希望它可以将其全部知识转移到单形学生以提高绩效。在本文中,我们研究了多模式KD的功效。我们首先提供了两个失败情况,并证明KD不是多模式知识转移中的普遍治疗方法。我们介绍了维恩图的模态,以了解模态关系和焦点的假设,从而揭示了多模式KD功效的决定性因素。6个多模式数据集的实验结果有助于证明我们的假设,诊断失败情况和点方向以提高蒸馏性能。
translated by 谷歌翻译
近期视觉变压器〜(VIT)模型在各种计算机视觉任务中展示了令人鼓舞的结果,因为他们的竞争力通过自我关注建模图像补丁或令牌的长距离依赖性。然而,这些模型通常指定每层中每个令牌特征的类似场景。这种约束不可避免地限制了每个自我注意层在捕获多尺度特征中的能力,从而导致处理具有不同尺度的多个对象的图像的性能下降。为了解决这个问题,我们提出了一种新颖和通用的策略,称为分流的自我关注〜(SSA),它允许VITS为每个关注层的混合秤的关注进行模拟。 SSA的关键概念是将异构接收领域的尺寸注入令牌:在计算自我注意矩阵之前,它选择性地合并令牌以表示较大的对象特征,同时保持某些令牌以保持细粒度的特征。这种新颖的合并方案能够自我注意,以了解具有不同大小的对象之间的关系,并同时降低令牌数字和计算成本。各种任务的广泛实验表明了SSA的优越性。具体而言,基于SSA的变压器实现了84.0 \%的前1个精度,并且在ImageNet上占据了最先进的焦距变压器,只有一半的模型尺寸和计算成本,并且在Coco上超过了焦点变压器1.3映射2.9 MIOU在ADE20K上类似参数和计算成本。代码已在https://github.com/oliverrensu/shunted-transformer发布。
translated by 谷歌翻译
This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
translated by 谷歌翻译
Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.
translated by 谷歌翻译
Deep learning-based methods have achieved significant performance for image defogging. However, existing methods are mainly developed for land scenes and perform poorly when dealing with overwater foggy images, since overwater scenes typically contain large expanses of sky and water. In this work, we propose a Prior map Guided CycleGAN (PG-CycleGAN) for defogging of images with overwater scenes. To promote the recovery of the objects on water in the image, two loss functions are exploited for the network where a prior map is designed to invert the dark channel and the min-max normalization is used to suppress the sky and emphasize objects. However, due to the unpaired training set, the network may learn an under-constrained domain mapping from foggy to fog-free image, leading to artifacts and loss of details. Thus, we propose an intuitive Upscaling Inception Module (UIM) and a Long-range Residual Coarse-to-fine framework (LRC) to mitigate this issue. Extensive experiments on qualitative and quantitative comparisons demonstrate that the proposed method outperforms the state-of-the-art supervised, semi-supervised, and unsupervised defogging approaches.
translated by 谷歌翻译
Conversational recommender systems (CRSs) often utilize external knowledge graphs (KGs) to introduce rich semantic information and recommend relevant items through natural language dialogues. However, original KGs employed in existing CRSs are often incomplete and sparse, which limits the reasoning capability in recommendation. Moreover, only few of existing studies exploit the dialogue context to dynamically refine knowledge from KGs for better recommendation. To address the above issues, we propose the Variational Reasoning over Incomplete KGs Conversational Recommender (VRICR). Our key idea is to incorporate the large dialogue corpus naturally accompanied with CRSs to enhance the incomplete KGs; and perform dynamic knowledge reasoning conditioned on the dialogue context. Specifically, we denote the dialogue-specific subgraphs of KGs as latent variables with categorical priors for adaptive knowledge graphs refactor. We propose a variational Bayesian method to approximate posterior distributions over dialogue-specific subgraphs, which not only leverages the dialogue corpus for restructuring missing entity relations but also dynamically selects knowledge based on the dialogue context. Finally, we infuse the dialogue-specific subgraphs to decode the recommendation and responses. We conduct experiments on two benchmark CRSs datasets. Experimental results confirm the effectiveness of our proposed method.
translated by 谷歌翻译